Service Level Agreement
Service Level Agreement or SLA is an agreement between IT Operations and the End Users they service. SLAs come in several different classes. Currently, ActiveBatch supports the Availability SLA. This type of SLA means that a certain workflow and/or Jobs within the workflow must have successfully completed by a specific time.
While SLA has a certain alert capability to it, that is not the sole purpose of marking a workflow or Job as SLA sensitive. When you enable SLA processing on a Plan (workflow) and/or Job you are indicating that this is a highly important and time critical part of your business. ActiveBatch’s responsibility is to try and help you avoid an SLA breach or a miss of the deadline the Job must have successfully completed by.
The determination of a deadline time must be carefully considered. There isn't always a way to speed up some workflows, especially if fine-tuning systems, code, etc. has already been performed. A 60 minute workflow can't be completed in 30 minutes. This means that your deadline time must be practical. The allowance of any hedge time (for example, a 50 minute Job that’s given 60 minutes to run) is up to you.
Availability SLAs may be specified on Plan or Job objects. This means that multiple deadlines may be in effect.
In the above figure, the SLA time specified is an “absolute deadline”. This type of deadline means that a clock time period represents the deadline. For example, 1500 means that this workflow or Job must have successfully completed by 1500 (or 3pm). Another type of deadline is “relative duration”. This means that the workflow or Job must complete by a deadline computed by adding the start time of the workflow/Job instance to the duration time. If the “Duration from RunTime Monitoring” is enabled, the relative duration is taken from that property. You can specify one or more Absolute Deadlines by clicking on the Add button. This allows you to enter a clock time. If you enter multiple Absolute Deadlines then the next closest clock time that has not expired will be used. “Next closest” is determined by using either the Scheduled time for the instance or the instance creation time. Only a single Relative Duration can be specified.
Absolute Deadlines are best used for a Scheduled Batch. In other words, where the time the batch is to start is very well know. So if a batch is to begin at 10am and execute for 30 minutes, a deadline of 11am becomes a simple calculation. Non-deterministic batches can still use absolute deadline but the deadline may be harder to achieve or for that matter determine. Let’s examine some Use Cases concerning SLA Absolute Deadline.
Use Case #1. Single Absolute Deadline. PlanA is scheduled to execute at 1000 and complete in 30 minutes. An absolute deadline of 1100 is set.
Success Path: PlanA completes successfully at 1030 and has met its SLA. If PlanA executes again that day, SLA processing will not be provided until the next day (Calendar or Business, as marked).
Failure Path 1: PlanA completes in failure at 1020; the SLA deadline is still in play. PlanA is rerun at 1030 and completes at 1059. PlanA has met its SLA.
Failure Path 2: PlanA starts at 1105. The Plan has instantly breached its SLA (in fact, the Plan will have breached its SLA at 1101). Every time PlanA is instantiated that day and fails, it will have been considered to have breached its SLA. The 1100 deadline is still in play until the next day.
Use Case #2. Multiple Absolute Deadlines. PlanA is scheduled to execute at 1000 and 1200 and complete in 30 minutes. An absolute deadline of 1100 and 1300 are set.
Success Path: PlanA completes successfully at 1030 and has met its SLA. If PlanA executes again at 1200 and completes successfully at 1230, the next SLA deadline of 1300 is used. PlanA has met that deadline as well. If PlanA executes again that day, SLA processing will not be provided until the next day (Calendar or Business, as marked).
Failure Path 1: PlanA completes in failure at 1020; the SLA deadline is still in play. PlanA is rerun at 1030 and completes at 1059. PlanA has met its SLA. When PlanA executes after 1101, the 1300 deadline would be used. Once that 1300 deadline has been successfully met, any future instances for that day are ignored in terms of SLA processing. If the 1300 deadline has not been met, all future instances for that day would be marked as having breached the SLA.
Failure Path 2: PlanA completes in failure at 1020; the SLA deadline at 1100 is still in play. PlanA is restarted at 1105. The Plan has instantly breached its SLA (in fact, the Plan will have breached its SLA at 1101). PlanA is manually triggered at 1110, the next deadline at 1300 is used. This illustrates the point that the instance’s creation date/time determines the deadline to use when multiple deadlines in a single day are used. By restarting PlanA the original 1100 deadline is used because the original PlanA instance was created at 1000 (which is before 1100 time-wise). When PlanA is manually triggered at 1110 a new instance is created with a time of 1110. This causes the next deadline to be used (i.e. 1300 since 1110 is past 1100).
Remedy Thresholds
Remedy Thresholds determine when and what to do if the time begins to get “tight” in terms of meeting the SLA. Remedy Thresholds provide two (2) options: Warning or Critical Alerts and Take Action. Both Warning and Critical indicate the severity of an alert or operator response. In the example above, Warning is set to 80% and Critical occurs at 90%. Possible alerts could be issued when those levels are reached. Take Action refers to a series of actions that ActiveBatch will take to help ensure this workflow meets its SLA time. For the threshold itself you can specify the period as a percentage of the SLA duration (either using Relative Duration or as a subtraction of the start time minus the SLA deadline time) or as a time period. For example, assume we had started this workflow at 1200. With a deadline of 1300, the duration is 1 hour (60 mins). An 80% level would be raised at 48 minutes. A 90% level would be raised at 54 minutes. Alternatively you could specify an absolute time period of 45 mins for warning and 55 mins for critical. The use of percentages or absolute time periods is up to you.
SLA Calculations and Actions
Calculation concerning a single Job is simple enough since the elapsed time is compared to either the specified Relative Duration or to the calculated remaining time based on the deadline time. An explanation is necessary when dealing with a workflow. For example, let’s say Plan A has a 40 minute Relative Duration and consists of three (3) Jobs; Job E, F and G which have expected durations of ten (10) minutes each. This would mean that the expectation duration of the Plan is 30 minutes. For this example, let’s say that a Warning Alert and Actions have been set at 80%. This calculates to a warning time at the 32 minute mark. With ActiveBatch SLA alerting, ActiveBatch uses the expected durations to help determine in a “forward” nature as to whether an SLA will be met. So assume Job E is currently executing at the twelve (12) minute mark. This would cause ActiveBatch to fire off a Warning Alert. Why? Because Job E’s 12 minutes would be added to Job F (10 minutes) and Job G (10 minutes) resulting in an expected time of 32 minutes. This approach provides the most upfront time to handle situations concerning SLA. An alternative would be to wait until the 32 minute mark was actually reached. If we did that, you and ActiveBatch would have 8 minutes instead of 28 minutes to react to a possible breach. Knowing how ActiveBatch calculates the remaining time is important because with Actions enabled it’s possible for an alert to be raised and then resolved multiple times. This is an expected action and means that ActiveBatch’s actions are having a positive impact on the SLA.
Note: Since ActiveBatch SLA processing makes use of the Average Expected Duration please note that if you make any changes to a workflow and/or Job(s) you should clear the Average Duration time in the object. Objects of zero (0) time duration are not considered when processing a workflow for SLA; in other words, only the actual elapsed time is considered).
When a workflow reaches a certain point where you feel it may become “late” that would be the time to begin having ActiveBatch “take actions”. Take Actions is a checkbox that’s part of the Remedy Thresholds. Once enabled, the system will begin taking actions (so you can enable “Take Actions” at the Warning level and then disable them at the Critical level). With “Take Actions” enabled the following actions occur:
-
The current executing Job’s OS priority is increased.
-
Any waiting Jobs Queue Priority is increased to SLA level.
-
Any Queues that will be used for this Job or workflow will have a “priority fence” erected to allow only SLA sensitive Jobs to proceed to dispatch.
The above steps will be performed iteratively until the workflow completes.
Note: SLA processing is not considered complete until the workflow or Job completes successfully. So if an SLA time is 60 mins and a workflow completes in failure at 40 mins; the SLA is not considered met. In fact, if an SLA Job/workflow fails, an SLA Critical level is asserted regardless of the time. Audits for actions taken can be seen on both the instance and definition level.
Audits for actions taken can be seen on both the instance and definition level